-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC GH17505 Added some links and examples. To little/much/wrong? #17908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17908 +/- ##
==========================================
- Coverage 91.23% 91.22% -0.02%
==========================================
Files 163 163
Lines 50105 50105
==========================================
- Hits 45715 45706 -9
- Misses 4390 4399 +9
Continue to review full report at Codecov.
|
|
||
df.groupby(['A', 'B']).sum().reset_index() | ||
``count``, Number of non-NA observations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can be links as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The links to the functions available with aggregate? I think that would be a great idea. Where can I find the list of available functions and the shortcuts? I figured that must have been documented elsewhere at some point but couldnt find it?
can you post a rendered version of this page? since you are doing lots of changes, hard to see what the new version would look like. |
Group By: split-apply-combine Split-apply-combine is a common paradigm in data analysis. It involves splitting The split step is the most straightforward. See the section on In the apply step you may wish to apply one of the following operations:
Pandas also supports iteration over the groups created in the split step; Using |
Aggregation¶This section describes how to aggregate data. We will be giving examples using the In [54]: tips = pd.read_csv('./data/tips.csv') In [55]: tips Out[55]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4 5 25.29 4.71 Male No Sun Dinner 4 6 8.77 2.00 Male No Sun Dinner 2 .. ... ... ... ... ... ... ... 237 32.83 1.17 Male Yes Sat Dinner 2 238 35.83 4.67 Female No Sat Dinner 3 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2 [244 rows x 7 columns] What if we wanted to know the average total bill on each day? We split the data so that each group consists of all the meals eaten on the same day. We want a single value for each group, so we should use the aggregate function: In [56]: tips.groupby('day').aggregate('mean') Out[56]: total_bill tip size day Fri 17.151579 2.734737 2.105263 Sat 20.441379 2.993103 2.517241 Sun 21.410000 3.255132 2.842105 Thur 17.682742 2.771452 2.451613 The result has the group names, in this case the days, as the index along the grouped axis. Along the other axis we have the columns for which Pandas could calculate a mean, i.e. the ones with a numeric data type. We could have selected the How about the number of guests for each day and for each time of day? In this case it is not enough to split the data on the day it was eaten, we also need split by the time of day. Instead of calculating the mean, like in the previous example we use the In [57]: tips.groupby(['day', 'time'])['size'].agg('sum') Out[57]: day time Fri Dinner 26 Lunch 14 Sat Dinner 219 Sun Dinner 216 Thur Dinner 2 Lunch 150 Name: size, dtype: int64
Pandas has support for a number of basic descriptive statistic functions which can be used with aggregate: What if we need to know the difference between the smallest and largest total bill for each day? Again we split the data so each group has the meals eaten on the same day. But which function do we use to find the difference? The In [58]: tips.groupby(['size']).agg(lambda group: max(group) - min(group))['total_bill'] Out[58]: size 1 7.00 2 34.80 3 40.48 4 31.84 5 20.50 6 21.12 Name: total_bill, dtype: float64 |
The formatting is not super but I hope it gives a better idea of whats been changed and how it would look. Is there a better way to do this? |
@linebp can you post a rendered screenshot of this? |
can you rebase and show a rendered screenshot? |
@linebp sorry let this get away from us. happy to have some clarifications, but can you do in a more targeted manner, IOW more PR's with smaller changes is usually better. |
Ill have a look at again and see if I can do a PR with a smaller change and do a screenschot of the rendered changes? |
git diff upstream/master -u -- "*.py" | flake8 --diff
Before I spend more time on this, I'd like to know if I am doing to much, to little or just plain all wrong.